Model Selection

Multimodal Q&A

# Multimodal Q&A

Llava 1.5 7b Hf Q4 K M GGUF

This model is a GGUF format conversion of llava-hf/llava-1.5-7b-hf, supporting image-to-text generation tasks.

Image-to-Text English

VL Rethinker 7B Mlx 4bit

VL-Rethinker-7B 4-bit MLX Quantized Version is a quantized variant of the TIGER-Lab/VL-Rethinker-7B model, optimized for Apple devices and supporting visual question-answering tasks.

Text-to-Image English

VL Rethinker 7B Fp16

This model is a multimodal vision-language model converted from Qwen2.5-VL-7B-Instruct, supporting visual question answering tasks.

Transformers English

VL Rethinker 72B 8bit

This model is a multimodal vision-language model converted from Qwen2.5-VL-7B-Instruct, supporting 8-bit quantization and suitable for visual question-answering tasks.

Transformers English

VL Rethinker 72B 4bit

VL-Rethinker-72B-4bit is a multimodal model based on Qwen2.5-VL-7B-Instruct, supporting visual question answering tasks, and has been converted to MLX format for efficient operation on Apple devices.

Transformers English

Gemma 3 4b It Abliterated Q4 0 GGUF

This model is a GGUF format conversion of mlabonne/gemma-3-4b-it-abliterated, combined with the visual component of x-ray_alpha for a smoother multimodal experience.

LLaVAction is a multimodal large language model evaluation and training framework for action recognition, based on the Qwen2 language model architecture, supporting first-person perspective video understanding.

Transformers English

MLAdaptiveIntelligence

Videochat Flash Qwen2 5 7B InternVideo2 1B

A multimodal video-text model built upon InternVideo2-1B and Qwen2.5-7B, using only 16 tokens per frame and supporting input sequences of up to 10,000 frames.

Transformers English

Asagi-8B is a large-scale Japanese Vision-Language Model (VLM) trained on extensive Japanese datasets, integrating diverse data sources.

Transformers Japanese

Erax VL 7B V2.0 Preview I1 GGUF

This is the result of applying weight/importance matrix quantization to the EraX-VL-7B-V2.0-Preview model, offering multiple quantization versions to suit different needs

Image-to-Text Supports Multiple Languages

Videochat Flash Qwen2 5 2B Res448

VideoChat-Flash-2B is a multimodal model built upon UMT-L (300M) and Qwen2.5-1.5B, supporting video-to-text tasks with only 16 tokens per frame and extending the context window to 128k.

Transformers English

Erax VL 7B V2.0 Preview

EraX-VL-7B-V2.0-Preview is a powerful multimodal model designed for OCR and visual question answering, excelling in processing multiple languages including Vietnamese, with outstanding performance in recognizing medical forms, invoices, and other documents.

Transformers Supports Multiple Languages

A multimodal model fine-tuned based on InternVL-Chat-V1-5, excelling in MMBench benchmark tests

Idefics2 8b Chatty

Idefics2 is an open multimodal model capable of accepting arbitrary sequences of images and text as input and generating text output. The model can answer questions about images, describe visual content, create stories based on multiple images, or function purely as a language model.

Transformers English

Heron Chat Git Ja Stablelm Base 7b V1

A vision-language model capable of conversing about input images, supporting Japanese interaction

Transformers Japanese

ChatTruth-7B is a multilingual vision-language model optimized based on the Qwen-VL architecture, enhanced with large-resolution image processing capabilities and incorporating a restoration module to reduce computational overhead

Transformers Supports Multiple Languages

Heron Chat Git Ja Stablelm Base 7b V0

Heron GIT Japanese StableLM Base 7B is a vision-language model capable of conversing about input images.

Transformers Japanese

IDEFICS is an open-source multimodal model capable of processing both image and text inputs to generate text outputs, serving as an open-source reproduction of Deepmind's Flamingo model.

Transformers English

Donut Refexp Combined V1

A model for visual question answering tasks, focusing on understanding user interface reference expressions.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase